How many researchers…?

image of a light bulb

The best thing about being a statistician…

John Tukey

Backyard #1

Children's Mercy Hospital building

Backyard #2

Cleveland Chiropractic College building

Backyard #3

MRI Global building

Backyard #4

North Kansas City Hospital building

Backyard #5

Saint Luke's Hospital building

Backyard #6

Truman Medical Center building

My favorite two backyards

UMKC and KUMC mascots

New backyard: Russ Waitman

Russ Waitman

i2b2 software

Screenshot of i2b2 software

The database structure behind i2b2

Diagram of i2b2 schema

How many surgeries?

## How many surgeries?

select_surgeries <- 
  "SELECT name_char FROM blueherondata.concept_dimension
     WHERE name_char LIKE '%ectomy%'"     
dbGetQuery(c_connect, select_surgeries) %>%   # Extract records
  use_series(NAME_CHAR) %>%                   # Convert to vector
  strsplit(" ") %>%                           # Split into words
  unlist %>%                                  # Re-convert to vector
  tolower %>%                                 # Force to lower case
  grep("ectomy", ., value=TRUE) %>%           # Toss extraneous words
  gsub("[[:punct:]]", "", .) %>%              # Remove punctuation
  gsub("ectomy.*", "-", .) %>%                # Remove ectomy suffix
  unique %>%                                  # Remove duplicates
  sample(100, replace=FALSE) %>%              # Select 100 random
  sort %>%                                    # Arrange
  paste(collapse=", ")                        # Delimit with commas

How many surgeries?

"acromion-, adenoid-, alveol-, apic-, apico-, arthr-, arytenoid-, astragal-, ather-, burs-, capsul-, carp-, clitor-, coccyg-, crani-, dacryoaden-, dacryocyst-, diaphys-, disarticulationhemipelv-, disk-, diverticul-, endarter-, epididym-, epiglottid-, epiplo-, ethmoid-, fasci-, fistul-, frenul-, ganglion-, gastr-, gingiv-, gloss-, hemigastr-, hemigloss-, hemilamin-, hemilaryng-, hemiphalang-, hemorrhoid-, hepat-, hymen-, hyster-, infundibul-, irid-, labyrinth-, lamin-, lip-, lump-, mucos-, my-, myom-, nephr-, nephroureter-, oophor-, osteophyt-, pannicul-, patell-, phalang-, pharyngolaryng-, phleb-, pleur-, plex-, pneumon-, postadenoid-, postcholecyst-, postgastr-, postlymphaden-, postmastoid-, postpolyp-, postprostat-, postsplen-, prostat-, rectosigmoid-, salping-, salpingoophor-, scler-, segment-, sequestr-, sialoaden-, sigmoid-, sphenoid-, sympath-, synov-, tenon-, tenosynov-, trabecul-, trachel-, trisection-, trisegment-, turbin-, tyl-, tympanomastoid-, umbil-, urethr-, uvul-, vagin-, valv-, vas-, vesicul-, vulv-"

Mei Liu, Acute Kidney Injury

Mei Liu, next to one of her research publications

Requirements needed for mining the electronic health record

Technical requirements

  • Working familiarity with SQL
  • Data wrangling skills

Non-technical requirements

  • Lust for data
  • An interesting backyard

Informatics meetup

Flier for May 23 Informatics meetup

Where you can find a copy of this talk.

This presentation was developed using R Markdown. You can find all the important stuff at

In particular, look for

  • doc/mining-v2-image-credits.txt
  • doc/mining-v2-slides.pptx
  • doc/mining-v2-speaker-notes.pdf

Not part of the final talk: Introduction

I'm listing a few notes here for my benefit as I develop my talk. These notes will be cut out when I produce the final presentation.

I have been invited to give a talk at the 2019 Midwest BioInformatics Conference (note the unusual capitalization)! I will prepare the talk in this file. For now, I want to keep some material up front and center as I develop this talk. This information will be marked with the label "Not part of the final talk."

Not part of the final talk: Affiliation

I will list my affiliation as Department of Biomedical and Health Informatics, University of Missouri-Kansas City, but I also need to acknowledge Frontiers as the supporting funding.

Here's a quote from Wu 2018.

"The dataset used for analysis described in this study was obtained from KUMC’s HERON clinical data repository which is supported by institutional funding and by the KUMC CTSA grant UL1TR000001 from NCRR/NIH."

References to put in my bibliography

I want to put these in BibTeX format.

Mei Liu's paper(s) on Acute Kidney Injury. I think that the Wu paper described on the previous page or Chen 2018 would be good.

Not part of the final talk: Faculty Research Symposium

Every year, UMKC hosts a faculty research symposium, where we all get a chance to put up posters bragging about the work we are doing. It might be an opportunity to publicize the May 23 Frontiers Informatics meetup. I might instead do a poster for my work with the Center for Economic Information, but I could still have flyers to hand out about the Frontiers Informatics meetup.

Join the Office of Research Services for the 5th annual faculty research symposium, an all-faculty exchange of research, scholarship and creative activity. The Faculty Research Symposium is 2 to 4 p.m. Wednesday, April 24 in the Student Union Room 401. Online registration ends Wednesday, April 17. Contact Leslie Burgess with questions at burgessla@umkc.edu 816-235-1520. From the UMatters website.

Not part of the final talk: date and location

  • April 11-12, 2019
  • University of Missouri – Kansas City
  • Atterbury Student Success Center- Pierson Auditorium
  • 5000 Holmes, Kansas City, MO 64110

My talk is on April 11 at a session from 1pm to 2pm on the session titled "Data Structures." The other speaker listed as of 3-12-2019 is Carolyn Lawrence-Dill from the Agronomy and Genetics Department of Iowa State University. From here abstract " Our work has focused on mapping genomes and gene elements, predicting gene function, inventing new ways to link genes to phenotypic descriptions and images, developing ways to compute on phenotypic descriptions, organizing broad datasets for community access and use, and developing computational tools that enable others to do all of these sorts of analyses directly"

Not part of the final talk: format

A panel moderator will introduce the theme of the panel and the speaker. Each speaker will have 8-10 minutes. Questions will be saved for the open discussion at the conclusion of each panel.

If you exceed 10 minutes, a member of the conference team will prompt you to conclude.

Not part of the final talk: content

The conference will include participants from the full spectrum of informatics, from bioinformatics analysis of genome or proteome data to statistical analysis of clinical data related to large populations.

  • Social media – If you are on Twitter, include your handle in your opening slide.
  • Accessible - As you prepare your presentation, consider a brief explanation of the problem, technology or methodology in terms that will make your presentation accessible to the full audience.
  • Visual – Consider using visuals instead of dense text as often as possible

Not part of the final talk: Video and other special requirements

  • Please communicate with Shaylee Yount (syount@bionexuskc.org) about any of the following elements that you plan to include:
  • Video stored locally
  • Active navigation through a web site

  • Note: I do not intend to use these.

Not part of the final talk: submitting your presentation, deadline

Please send your presentation as a PowerPoint file to Shaylee Yount (syount@bionexuskc.org) by Friday March 28th at 5:00 pm.

Spread the word! – As a speaker, we hope that you will generate excitement about the conference in your community. Please encourage your students, colleagues, collaborators and community to participate. We would like to have a high level of student participation and will appreciate it if you encourage students and postdocs to submit posters. Please share the link to the conference, Tweet about your presentation using our hashtag #MWBio19.

Not part of the final talk: conference website

Not part of the final talk: My abstract

The electronic health record (EHR) offers opportunities for research and quality improvement studies that did not exist before. Data mining, discovering new and unexpected patterns in the data, requires a different mode of access for EHR data than more traditional hypothesis driven studies. This talk will cover the specialized statistical and programming skills needed for data mining.

Not part of the final talk: General structure of the talk

I only have ten minutes, so I need to be judicious in what I cover. I want to frame my talk around the famous quote by John Tukey: "The best thing about being a statistician is that you get to play in everyone's backyard."

I think I have time to slip in a bit of trivia about Dr. Tukey: he invented the boxplot and coined the terms "software" and "bit" (short for binary digit).

On the final slide, I will show the session we have planned for the Frontiers Informatics meetup (open data clinical platforms) on May 23 and refer back to the quote with something along the lines of "if you have an interesting backyard, I want to see you there."

Not part of the final talk: General structure, continued

In ten minutes, I only have time to make three points.

  1. If you are interested in data mining, you have to move past the i2b2 software and data builder, and access the data directly with SQL. This requires multiple self-joins

  2. The nature of the electronic health record creates a data structure well suited for the sparse matrix format

  3. Data mining includes mining information from the metadata.

Not part of the final talk: Things to emphasize

Place in the context of i2b2, which is an open standard.

Open Clinical Data Analytic Platform.

[Left out of final talk] Some of my backyards

Academic Emergency Medicine, American Journal of Audiology, American Journal of Kidney Diseases, Annals of Allergy, Asthma & Immunology, Annals of Behavioral Medicine, Annals of Occupational Hygiene, Applied Clinical Informatics, Applied Occupational and Environmental Hygiene, Archives of Dermatology, Archives of Pediatrics & Adolescent Medicine, Birth Defects Research. Part A, Clinical and Molecular Teratology, BMJ (Clinical Research Ed.), Cancer Letters, Cell Death & Disease, Child Maltreatment, Clinical Nephrology, Clinical Pharmacology and Therapeutics, Clinical Toxicology, Endocrinologist, Genomics, Hospital Pharmacy, Journal of Andrology, Journal of Applied Toxicology, Journal of Bone and Joint Surgery, Journal of Clinical Pharmacology, Journal of Clinical Psychology in Medical Settings, Journal of General Internal Medicine, Journal of Human Lactation, Journal of Nursing Administration, Journal of Obstetric, Gynecologic, and Neonatal Nursing, Journal of Occupational and Environmental Medicine, Journal of Pediatric Endocrinology & Metabolism, Journal of Perinatology, Journal of the Acoustical Society of America, Journal of the American Academy of Dermatology, Journal of the American Society of Echocardiography, Journal of the International AIDS Society, Molecular Andrology, Occupational Hygiene, Pediatric Blood & Cancer, Pediatric Cardiology, Pediatric Emergency Care, Pediatric Nephrology, Pediatrics, Perception & Psychophysics, Pharmacogenetics, Reproductive Biomedicine Online, Reproductive Toxicology, Scanning Microscopy, Seminars for Nurse Managers.

[Left out of final talk] Bobby tables cartoon

Cartoon describing an injection attack

[Left out of final talk] How many surgeries?

"adenoid-, adrenal-, apico-, arytenoid-, cervic-, cord-, costotransvers-, craniotomycrani-, crypt-, disk-, empyem-, epiplo-, fissur-, hemigloss-, hemilaryng-, hemiphalang-, hemispher-, hepat-, hypophys-, iridotomyirid-, kerat-, kyph-, mastoid-, mucos-, myom-, nephr-, oophor-, opercul-, orchi-, ost-, parathyroid-, pericardi-, pharyngolaryng-, plex-, pneumon-, polyp-, postadenoid-, postcholecyst-, posthyster-, postlymphaden-, posttonsill-, sphenoid-, stern-, synov-, tenosynov-, tracheostomylaryng-, tyl-, tympanosympath-, umbil-, vesicul-"

[Left out of final talk] Mining the Electronic Health Record

What is data mining?

  • "Data mining is the process of finding anomalies, patterns and correlations within large data sets" (SAS Institute)

What is the electronic health record

  • "An electronic health record (EHR) is a digital version of a patient’s paper chart. EHRs are real-time, patient-centered records that make information available instantly and securely to authorized users." (healthit.gov)